Downstream Analysis of Synteny Networks with MSA2dist

Kristian K Ullrich

Max Planck Institute for Evolutionary Biology

Background - DNA Evolution

A matter of distance.

Background - Coding Sequences

Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.

Typically used to:

  1. Detect and Date whole-genome duplication (WGD) events;
  2. Predict selective pressure acting on a protein;
  3. Predict selective pressure acting on a codon;
  4. Investigate codon usage (R/Bioconductor package coRdon).

MSA2dist

An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.

MSA2dist Features:

  1. Calculates pairwise AA, DNA and IUPAC distances;
  2. Calculates Synonymous and NonSynonymous Substitutions (dN/dS) KaKs_Calcultor 2.0 models (C++ ported to R with Rcpp);
  3. Use pre-calculated MSA or conduct all possible pairwise alignments prior dN/dS calculation;
  4. Calculate average behavior of each codon from MSA.

MSA2dist workflow

Installation

From Bioconductor:

BiocManager::install(version='devel')
BiocManager::install("MSA2dist")


From GitHub:

remotes::install_github("kullrich/MSA2dist")

Example data set

Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.

Source: Pico-PLAZA 3.0

Where to find: the data/ directory in the GitHub repo associated with this presentation.

data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
  1. codingsequences.rda: DNAStringSet object.
  2. clusters.rda: Pre-calculated syntenet clusters.

Example data set: codingsequences

# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(dplyr)
library(stringr)

# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))

# Inspect data
head(names(codingsequences))
[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"

Example data set: clusters

# Load data set clusters
load(here::here("data", "clusters.rda"))

# Inspect data
head(clusters)
                   Gene Cluster
1 Chlo_CNC64A_028G00030       1
2 Chlo_CNC64A_028G00040       2
3 Chlo_CNC64A_028G00070       3
4 Chlo_CNC64A_028G00080       4
5 Chlo_CNC64A_028G00110       5
6 Chlo_CNC64A_028G00140       6
length(table(clusters$Cluster))
[1] 22605

Fetch coding sequences for a specific cluster

# Get size distribution of clusters
head(table(clusters$Cluster))

 1  2  3  4  5  6 
38 38 37  2  9  4 
# Get one random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
    Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster)
                      Gene Cluster
1 Bpra_BPRRCC1105_18G01170   15993
2 Bpra_BPRRCC1105_18G01180   15993
3     Osp_ORCC809_11G01220   15993
4          Oluc_OL11G00350   15993
5         Omed_OM_11G00390   15993
6         Omed_OM_11G00400   15993

Fetch coding sequences for a specific cluster

# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
my_cluster
                       Gene Cluster              GeneID
1  Bpra_BPRRCC1105_18G01170   15993 BPRRCC1105_18G01170
2  Bpra_BPRRCC1105_18G01180   15993 BPRRCC1105_18G01180
3      Osp_ORCC809_11G01220   15993    ORCC809_11G01220
4           Oluc_OL11G00350   15993          OL11G00350
5          Omed_OM_11G00390   15993         OM_11G00390
6          Omed_OM_11G00400   15993         OM_11G00400
7      Msp_MRCC299_12G04700   15993    MRCC299_12G04700
8           Mpus_MP10G00340   15993          MP10G00340
9          Otau_OT_11G00400   15993         OT_11G00400
10         Otau_OT_11G00390   15993         OT_11G00390
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID, 
    names(codingsequences))]
my_cds
DNAStringSet object of length 10:
     width seq                                              names               
 [1]  2859 ATGAAGCCAAAAGTTTCCAAATA...AGAAAATGTAGAAAGAGATTAG BPRRCC1105_18G01170
 [2]  1350 ATGAGTTCAATTACGTGTAATAC...TAGGTGTTTCCAAGGATTCTAG BPRRCC1105_18G01180
 [3]  2829 ATGTCTAGAGCGACGCGTTTGGA...CGAGAGAAGGGAGAACCCGTAG ORCC809_11G01220
 [4]  2985 ATGGAGGCGGACGACGCGGCGGC...AAGCGCCCCAAAGCGCACTTAG OL11G00350
 [5]  1371 ATGAAGCGCACCAGGAGCGACGA...TGCCTGCTTTCAAGGATTTTAA OM_11G00390
 [6]  2079 ATGCCGTCGCCCGCACCGCGCGC...CGCTCGAGCCGCCGCTCCGTAA OM_11G00400
 [7]  1941 ATGGACGAGCTGGGTGGAGGCTG...CGATCGATGCTTCATGTACTGA MRCC299_12G04700
 [8]  1509 ATGGGCGGCGACGACGCGACCTC...CGACGCGTGCTTCATGTTCTGA MP10G00340
 [9]  1950 ATGTCCACGAGCGAGAATTCTCC...CGGACGCCACTGCGACCGCTAA OT_11G00400
[10]  1152 ATGGCGTCGACGGGCGTGGACGA...TGAATGTTTCCAGGGATTCTAA OT_11G00390

Coding sequence alignment

To calculate a coding sequence alignment for two sequences:

# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_aln
DNAStringSet object of length 2:
    width seq                                               names               
[1]  2907 ATGAAGCCAAAAGTTTCCAAATA...AAGAAAATGTAGAAAGAGATTAG BPRRCC1105_18G01170
[2]  2907 ATG--------------------...--------------------TAG BPRRCC1105_18G01180

To calculate dN/dS from this alignment:

# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
    model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks
  Comp1 Comp2                seq1                seq2       ka      ks
1     1     2 BPRRCC1105_18G01170 BPRRCC1105_18G01180 0.721665 2.35973
         vka       vks
1 0.00374533 0.4552966

Parallelized pairwise coding sequence alignments

To calculate all pairwise combinations:

start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
    model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()

head(my_cds_kaks)
         Comp1 Comp2                seq1                seq2 Method       Ka
result.1     1     2 BPRRCC1105_18G01170 BPRRCC1105_18G01180     YN 0.769457
result.2     1     3 BPRRCC1105_18G01170    ORCC809_11G01220     YN 0.493889
result.3     1     4 BPRRCC1105_18G01170          OL11G00350     YN 0.504261
result.4     1     5 BPRRCC1105_18G01170         OM_11G00390     YN 0.916875
result.5     1     6 BPRRCC1105_18G01170         OM_11G00400     YN 0.528609
result.6     1     7 BPRRCC1105_18G01170    MRCC299_12G04700     YN 0.869959
              Ks    Ka/Ks P-Value(Fisher) Length S-Sites N-Sites
result.1 2.61875 0.293826     4.60612e-14   1299 284.142 1014.86
result.2 4.50708 0.109581     1.64868e-70   1803 407.257 1395.74
result.3 4.54449 0.110961      2.3001e-66   1872 428.082 1443.92
result.4 4.26276  0.21509     1.06185e-13   1257 294.029 962.971
result.5  4.5556 0.116035     4.22173e-70   1857  434.47 1422.53
result.6 4.40849 0.197337     3.79844e-39   1644 357.089 1286.91
         Fold-Sites(0:2:4) Substitutions S-Substitutions N-Substitutions
result.1                NA           691         206.618         484.382
result.2                NA           848         344.128         503.872
result.3                NA           882         353.342         528.658
result.4                NA           731         224.582         506.418
result.5                NA           906         367.581         538.419
result.6                NA           973         312.786         660.214
         Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1                          NA                          NA
result.2                          NA                          NA
result.3                          NA                          NA
result.4                          NA                          NA
result.5                          NA                          NA
result.6                          NA                          NA
         Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1         1.17397                              1.24685:1.24685:1:1:1:1
result.2         1.40038                                1.1745:1.1745:1:1:1:1
result.3         1.42817                              1.16575:1.16575:1:1:1:1
result.4         1.69952                              1.07489:1.07489:1:1:1:1
result.5         1.47078                            0.907996:0.907996:1:1:1:1
result.6         1.63855                              1.17216:1.17216:1:1:1:1
                                    GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1  0.349329(0.376677:0.27967:0.391641)       NA   NA            NA    NA
result.2 0.415121(0.440495:0.330757:0.474111)       NA   NA            NA    NA
result.3 0.422399(0.447846:0.334089:0.485261)       NA   NA            NA    NA
result.4 0.374579(0.407071:0.305051:0.411616)       NA   NA            NA    NA
result.5  0.429662(0.47807:0.353314:0.457602)       NA   NA            NA    NA
result.6 0.439264(0.455756:0.354424:0.507612)       NA   NA            NA    NA

How long did it take?

end_time - start_time
Time difference of 10.0548 secs

Calculate average behavior of each codon

Here, a pre-calculated MSA is necessary:

# load example data
data(hiv, package="MSA2dist")

# calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon:

Here’s where you can find me: